https://github.com/villetaro/IODS-project
Moving forward with analysis!
learning2014 <- read.table(file="http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt", header=T, sep=",")
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
The data has 166 observations and now 7 variables. Variables contain gender, age, points from the exam and attitude towards statistics. There are three sum variables from combining questions measuring “strategic approach”, “deep approach”, and “surface approach”.
summary(learning2014)
## gender age attitude deep stra
## F:110 Min. :17.00 Min. :1.400 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :3.200 Median :3.667 Median :3.188
## Mean :25.51 Mean :3.143 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :5.000 Max. :4.917 Max. :5.000
## surf points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
head(learning2014)
## gender age attitude deep stra surf points
## 1 F 53 3.7 3.583333 3.375 2.583333 25
## 2 M 55 3.1 2.916667 2.750 3.166667 12
## 3 F 49 2.5 3.500000 3.625 2.250000 24
## 4 M 53 3.5 3.500000 3.125 2.250000 10
## 5 M 49 3.7 3.666667 3.625 2.833333 22
## 6 F 38 3.8 4.750000 3.625 2.416667 21
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
There is about twice as much females comparered to men. Age ranges from 17 to 55 with mean being 25.51. This gives us a big range in the scale of age. The variable points has a mean of 22.72 with the minimum being 7.00 and the maximum being 33.0.
Now we can also make a graphical overview of the data with ggplot2.
# access the GGally and ggplot2 libraries
library(GGally)
library(ggplot2)
p <- ggpairs(learning2014, mapping = aes(col=gender, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
p
The grapical overview could have been a little bit better. I deciced to highlight the gender as I think it was the most important variable in the model.
From the graph we can see the correlations between the different variables. The highest correlation is between attitude and points. The lowest correlation is a close battle with age - attitude and age - points. Overall the variable age seems to have quite low correlations compared to the others.
malli <- lm(points ~ attitude + stra + surf, data=learning2014)
summary(malli)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
The only variable with enough significance in p-value is attitude. This is the reason why we remove stra and surf variables from the model. Now we fit it again.
malli <- lm(points ~ attitude, data=learning2014)
summary(malli)
##
## Call:
## lm(formula = points ~ attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
The multiple R-squared refers to the usual R^2. R^2 can be interpreted as the fraction of the variance of the result variable that is explained by the explanatory variables. In the model R^2 is 0.1906 which is quite low so one could say that it doesn’t explain the result variable too well.
library(ggplot2)
plot(malli, which=c(1, 2,5), col = "turquoise2", lwd = 3)
“Residuals vs Fitted”-plot shows whether the residuals have non-linear patterns. Comparing this picture with the models first three assumptions we seem to have a few outliers. Otherwise everything looks good.
“Normal Q-Q”-plot shows whether the residuals are normally distributed. This means we can see how well the theoretical residuals correspond to the actual ones.
“Residuals vs Leverage” helps us to look for the outliers.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
alc <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt ", header=T, sep=",")
dim(alc)
## [1] 382 35
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
The data has 382 observations and 35 variables after we did the following changes: - The variables not used for joining the two data have been combined by averaging (including the grade variables) - ‘alc_use’ is the average of ‘Dalc’ and ‘Walc’ - ‘high_use’ is TRUE if ‘alc_use’ is higher than 2 and FALSE otherwise
The data is about students alcohol consumption.
I’m going to conduct an analysis on relationships between high/low alcohol consumption and sex, age and the grades G1, G2 and G3.
My hypothesis were as follows:
H1: Lower grades lead to higher alcohol consumption. H2: Male students drink more than female students. H3: The age variable doesn’t have a lot of effect.
Drawing the plots and figuring out how my hypotheses begins to look.
library(ggplot2)
p1 <- ggplot(alc, aes(x = high_use, y = G1, col = sex)) + geom_boxplot() + ylab("first period grade")
p2 <- ggplot(alc, aes(x = high_use, y = G2, col = sex)) + geom_boxplot() + ylab("second period grade")
p3 <- ggplot(alc, aes(x = high_use, y = G3, col = sex)) + geom_boxplot() + ylab("final grade")
p4 <- ggplot(alc, aes(x = high_use, y = age, col = sex)) + geom_boxplot() + ylab("age")
multiplot(p1, p2, p3, p4, cols = 2)
We begin from the top left corner. The grades go from 0 to 20 and alcohol usage has a binary (0,1) value. The binary values is presented in false and high. There we see that first period grades are a little bit better when high alcohol usage gets the false value. we also see that the tails are long which means also higher outliers in the model.
The second picture in the left bottom corner tells quite the same story as the picture above but now y-scale is second period grade. Quite interesting is that the tails in the lower end of grade scale are high as there a dots in the grade = 0. That means outliers again.
The third picture in the top right corner has final grade and alcohol consumption. We don’t really see any large changes so let’s move on.
The fouth and final picture in the bottom right corner is very interesting. First of all we see that the age is quite low in the high alcohol usage for example compared to Finland. The mean value in female group is 16 and 17 in the male group. That is why it is also smart to look at the age factor in the whole data. It looks like this:
summary(alc$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15.00 16.00 17.00 16.59 17.00 22.00
Now lets make a model based on the hypotheses above. I’m trying to explain high usage of alcohol with grades, sex and age. I’m using logistic regression model below.
model2 <- glm(high_use ~ G1 + G2 + G3 + sex + age, family = "binomial", data = alc)
summary(model2)
##
## Call:
## glm(formula = high_use ~ G1 + G2 + G3 + sex + age, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5220 -0.8417 -0.6686 1.1570 2.1413
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.895489 1.807142 -2.156 0.0311 *
## G1 -0.178473 0.073996 -2.412 0.0159 *
## G2 0.004475 0.088770 0.050 0.9598
## G3 0.081996 0.061465 1.334 0.1822
## sexM 0.962330 0.239154 4.024 5.72e-05 ***
## age 0.212090 0.103040 2.058 0.0396 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 433.69 on 376 degrees of freedom
## AIC: 445.69
##
## Number of Fisher Scoring iterations: 4
Now lets look at the results of model. To me it seems like the p-value is quite low and it indicates that only the variable sex is significant. Age and the grade variables have a much higher p-value which means that they are not that significant at all. This could be because the model could also be used the otherway: explaining grades with high usage of alcohol. This is a valid point but with logistic regression we cannot easily make model this way so we leave it for another time.
Now we calculate odd raiots and their CIs.
OR <- coef(model2) %>% exp
CI <- confint(model2) %>% exp
## Waiting for profiling to be done...
cbind(OR,CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.02033343 0.0005560289 0.6762128
## G1 0.83654645 0.7201998351 0.9632938
## G2 1.00448490 0.8433762954 1.1972492
## G3 1.08545156 0.9673539597 1.2340260
## sexM 2.61778817 1.6469686500 4.2128194
## age 1.23625854 1.0123920001 1.5178536
Looking at the odd ratios the first thing we see that the risk of high use was about 2.6 times more likely in males than females. The next thing is that G1 gets quite low value as does the other grade variables G2 and G3.
I decided to leave out G2, G3 and age as they were not significant in the model. Then i made a new model below. The G1 could’ve also been left out but I wanted to keep one grade in the model as I first had the all three with.
model3 <- glm(high_use ~ G1 + sex, family = "binomial", data = alc)
summary(model3)
##
## Call:
## glm(formula = high_use ~ G1 + sex, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3037 -0.8390 -0.6862 1.2508 2.0911
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.39990 0.39690 -1.008 0.31368
## G1 -0.09263 0.03590 -2.580 0.00987 **
## sexM 0.96992 0.23735 4.086 4.38e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 440.48 on 379 degrees of freedom
## AIC: 446.48
##
## Number of Fisher Scoring iterations: 4
The result completely different now than compared to the first and G1 got the second *. It is looking a lot more significant now.
Also the intercept rose from -3.895489 to -0.39990 which is a huge change.
OR <- coef(model3) %>% exp
CI <- confint(model3) %>% exp
## Waiting for profiling to be done...
cbind(OR,CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.6703900 0.3065190 1.457539
## G1 0.9115336 0.8484642 0.976978
## sexM 2.6377259 1.6654805 4.230000
Second look at the odds ratios which look quite the same as the first time. Sex still has the highest prediction power.
I drew a crosstab of probabilities. It tells that about 68% out of 92% of times the model predicted correctly low use and about 5% out of 8% the model was correct in trying to predict high alcohol usage.
probabilities <- predict(model3, type = "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probabilities > 0.5)
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table %>% addmargins
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.69109948 0.01570681 0.70680628
## TRUE 0.26439791 0.02879581 0.29319372
## Sum 0.95549738 0.04450262 1.00000000
Then I calculated the penalty/loss rate and this tells how many times the model classified observations incorrectly. The average number of incorrect predictions was 0.2801047.
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2801047
The last step was to cross validate my model against randomly allocated parts of the alc data. I used a 10-fold cross-validation. The average amount of wrong preditions is around 30 which is higher than the one I got above
library(boot)
# K-fold cross-validation
cv <- cv.glm(data = alc, cost = loss_func, glmfit = model3, K = 10)
# average number of wrong answers
cv$delta[1]
## [1] 0.2879581
We begin this week by analysing data from R’s MASS package. The data contains information about The Boston area house-prices.
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
The data has 506 observations and 14 different variables. Next we plot the data to show the graphical overview and then we check how different variables compare to each other.
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
This week we focus on the crim variable.
cor_matrix<-cor(Boston)
cor_matrix %>% round(digits = 2)
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
wb <- c("white","black")
corrplot(cor_matrix, order="hclust", addrect=2, bg="gold2", col=wb)
This is a correlation plot which explain the correlations of different variables when being compared to each other. The bigger the spot the stronger the correlation.
One of the biggest one in the positive side is rad - tax. From the negative correlation side the biggest ones are dis - indus, dis - nox and dis - age.
Next we standardize the data.
myboston_scaled <- scale(Boston)
summary(myboston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
After we standardized all the means are now 0.
myboston_scaled <- as.data.frame(myboston_scaled)
cutoffs <- quantile(myboston_scaled$crim)
labels <- c("low", "med_low", "med_high", "high")
crime_category <- cut(myboston_scaled$crim, breaks = cutoffs, include.lowest = TRUE, label = labels)
table(crime_category)
## crime_category
## low med_low med_high high
## 127 126 126 127
Now we drop the old crime rate variable from the dataset. Then we divide the dataset to train and test sets, so that 80% of the data belongs to the train set.
myboston_scaled <- dplyr::select(myboston_scaled, -crim)
myboston_scaled <- data.frame(myboston_scaled, crime_category)
str(myboston_scaled)
## 'data.frame': 506 obs. of 14 variables:
## $ zn : num 0.285 -0.487 -0.487 -0.487 -0.487 ...
## $ indus : num -1.287 -0.593 -0.593 -1.306 -1.306 ...
## $ chas : num -0.272 -0.272 -0.272 -0.272 -0.272 ...
## $ nox : num -0.144 -0.74 -0.74 -0.834 -0.834 ...
## $ rm : num 0.413 0.194 1.281 1.015 1.227 ...
## $ age : num -0.12 0.367 -0.266 -0.809 -0.511 ...
## $ dis : num 0.14 0.557 0.557 1.077 1.077 ...
## $ rad : num -0.982 -0.867 -0.867 -0.752 -0.752 ...
## $ tax : num -0.666 -0.986 -0.986 -1.105 -1.105 ...
## $ ptratio : num -1.458 -0.303 -0.303 0.113 0.113 ...
## $ black : num 0.441 0.441 0.396 0.416 0.441 ...
## $ lstat : num -1.074 -0.492 -1.208 -1.36 -1.025 ...
## $ medv : num 0.16 -0.101 1.323 1.182 1.486 ...
## $ crime_category: Factor w/ 4 levels "low","med_low",..: 1 1 1 1 1 1 2 2 2 2 ...
n <- nrow(myboston_scaled)
eighty_perc <- sample(n, size = n * 0.8)
train <- myboston_scaled[eighty_perc,]
test <- myboston_scaled[-eighty_perc,]
Now we fit the linear discriminant analysis on the train set. We use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. This gives us an idea how crime rate interacts with the other variables.
lda.fit <- lda(crime_category ~ ., data = train)
lda.fit
## Call:
## lda(crime_category ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2500000 0.2500000 0.2524752 0.2475248
##
## Group means:
## zn indus chas nox rm
## low 0.9287666 -0.8830834 -0.1164043059 -0.8748984 0.3875035
## med_low -0.1395533 -0.2570728 0.0005392655 -0.5667637 -0.2079788
## med_high -0.3930786 0.1898356 0.1136611515 0.3826808 0.1446454
## high -0.4872402 1.0171519 -0.1148450583 1.0565878 -0.4888785
## age dis rad tax ptratio
## low -0.8920625 0.8464170 -0.6885004 -0.7305705 -0.38694360
## med_low -0.3571764 0.3639241 -0.5554602 -0.4573398 -0.02839526
## med_high 0.3775655 -0.3571319 -0.3817409 -0.2864436 -0.22716881
## high 0.8065370 -0.8479290 1.6377820 1.5138081 0.78037363
## black lstat medv
## low 0.37943887 -0.752736895 0.48270313
## med_low 0.30996645 -0.098814091 -0.03747783
## med_high 0.06152535 0.004362439 0.16794902
## high -0.77090023 0.907909384 -0.70139538
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.09869028 0.85748644 -0.92684261
## indus 0.02170545 -0.28353339 0.18802796
## chas -0.09471442 -0.08336097 0.14621147
## nox 0.40959546 -0.69898490 -1.30199990
## rm -0.08844097 -0.13974689 -0.27077394
## age 0.23030287 -0.31895331 -0.09844172
## dis -0.06873529 -0.42799737 0.25350680
## rad 3.04606813 0.91047537 -0.11729830
## tax -0.09774178 -0.02820551 0.67536457
## ptratio 0.13431940 -0.01616636 -0.28617157
## black -0.12846086 0.01983506 0.11562500
## lstat 0.19772471 -0.22546177 0.37879521
## medv 0.16137340 -0.39342470 -0.15796604
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9416 0.0416 0.0168
Now we draw the LDA biplot
lda.arrows <- function(x, myscale = 0.5, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime_category)
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)
Now we predict the classes with the LDA model on the test data. We can help our analysis by cross tabulating the results with the crime categories from the test set.
correct_classes <- test$crime_category
test <- dplyr::select(test, -crime_category)
We see that in our model the first three groups (low, med_low and med_high) are very close to each other. The grouping is not that obvious and could use some more thought.
lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 15 11 0 0
## med_low 7 13 5 0
## med_high 1 8 15 0
## high 0 0 0 27
Let’s see how well the model actually worked. I made a little observation during this analysis that the model changes everytime when I ran the code as some of the numbers come from random variables. As you read this the number may look a little different but I try to get the overall look.
The number of total observations
total <- c(13+10+6+11+7+12+18+25)
total
## [1] 102
We have 102 observations.
The number of correct predictions in this instance.
correct <- c(13+18+11+25)
correct
## [1] 67
We had 67 correctly predicted observations.
wrong <- c(102-67)
wrong
## [1] 35
And 35 of the observations were incorrectly predicted.
The accuracy of our model is the following:
ratio <- c(correct/total)
ratio
## [1] 0.6568627
At this time the model did ok but as I ran the code a couple of times the accuracy was quite different (0.66 as I write this).
Now we reload the Boston dataset and standardize it. Then we take a look at the distances between the variables.
New_Boston <- Boston
str(New_Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
New_Boston_scaled <- scale(New_Boston) %>% as.data.frame()
str(New_Boston_scaled)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num -0.419 -0.417 -0.417 -0.416 -0.412 ...
## $ zn : num 0.285 -0.487 -0.487 -0.487 -0.487 ...
## $ indus : num -1.287 -0.593 -0.593 -1.306 -1.306 ...
## $ chas : num -0.272 -0.272 -0.272 -0.272 -0.272 ...
## $ nox : num -0.144 -0.74 -0.74 -0.834 -0.834 ...
## $ rm : num 0.413 0.194 1.281 1.015 1.227 ...
## $ age : num -0.12 0.367 -0.266 -0.809 -0.511 ...
## $ dis : num 0.14 0.557 0.557 1.077 1.077 ...
## $ rad : num -0.982 -0.867 -0.867 -0.752 -0.752 ...
## $ tax : num -0.666 -0.986 -0.986 -1.105 -1.105 ...
## $ ptratio: num -1.458 -0.303 -0.303 0.113 0.113 ...
## $ black : num 0.441 0.441 0.396 0.416 0.441 ...
## $ lstat : num -1.074 -0.492 -1.208 -1.36 -1.025 ...
## $ medv : num 0.16 -0.101 1.323 1.182 1.486 ...
Now we calculate the distances.
# Euclidean distance matrix using set.seed()
set.seed(123)
dist_eu <- dist(New_Boston_scaled)
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4620 4.8240 4.9110 6.1860 14.4000
This is based on the Euclidean measure. And above you see the summary of the findings. Now we can start the clustering with the amount of 10 clusters. 10 is a safe bet that nothing would go wrong, below I calculate the optimum amount of clusters.
library(ggplot2)
km <-kmeans(dist_eu, centers=10)
pairs(New_Boston_scaled, col = km$cluster)
As there is 10 clusters at the moment it is really hard to see what actually happens in this plot.
Because of this we want to optimize the amount of clusters so we don’t have to guess it and then we can take a look at the plot again..
k_max <- 10
wss <- sapply(1:k_max, function(k){kmeans(dist_eu, k)$tot.withinss})
plot(1:k_max, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")
It seems that the optimal amount of clusters is 2. So we run the cluster analysis again and then take a look at the plot.
km <-kmeans(dist_eu, centers=2)
pairs(New_Boston_scaled, col = km$cluster)
The above picture is quite similar compared to the one a moment ago but now the clusters are also included. The variables tax and rad seem to work the best as in the differences are most clear compared to the plot without the clusters.
Next we perform LDA using the clusters as target classes. We include all the variables in the Boston data in the LDA model.
km <-kmeans(dist_eu, centers = 3)
lda.fits <- lda(km$cluster~., data = myboston_scaled)
lda.fits
## Call:
## lda(km$cluster ~ ., data = myboston_scaled)
##
## Prior probabilities of groups:
## 1 2 3
## 0.1383399 0.2569170 0.6047431
##
## Group means:
## zn indus chas nox rm age
## 1 -0.4321124 1.0399621 0.62757956 1.2667725 -0.5975497 0.8264008
## 2 -0.4872402 1.0743771 -0.18147291 0.7905978 -0.3261484 0.7799367
## 3 0.3058467 -0.6943345 -0.06646762 -0.6256594 0.2752542 -0.5203916
## dis rad tax ptratio black lstat
## 1 -0.9294285 1.1838094 1.2037534 0.2897641 -1.547587007 1.0514755
## 2 -0.7234163 0.7284612 0.8572944 0.5527964 0.004362524 0.6244404
## 3 0.5199481 -0.5802831 -0.6395785 -0.3011341 0.352169811 -0.5058188
## medv crime_categorymed_low crime_categorymed_high
## 1 -0.6373069 0.01428571 0.2000000
## 2 -0.5333324 0.10000000 0.3230769
## 3 0.3723683 0.36601307 0.2287582
## crime_categoryhigh
## 1 0.7714286
## 2 0.5615385
## 3 0.0000000
##
## Coefficients of linear discriminants:
## LD1 LD2
## zn -0.39468364 -0.09677448
## indus -1.36978210 -1.03566594
## chas -0.29441343 0.53112017
## nox -0.54054935 0.56258301
## rm 0.18662157 -0.41617536
## age -0.18175313 -0.30440972
## dis -0.08851529 -0.03497269
## rad -0.30229151 -0.04071641
## tax -0.35134096 0.17638898
## ptratio -0.10630867 -0.07487010
## black 0.47176537 -1.09887695
## lstat -0.40545791 0.29469437
## medv -0.31423331 0.48310594
## crime_categorymed_low 0.61873110 -0.06078345
## crime_categorymed_high 0.09629715 -0.73535354
## crime_categoryhigh -0.08179388 -0.84651510
##
## Proportion of trace:
## LD1 LD2
## 0.9151 0.0849
Now we visualize the results with a biplot and then analyze the results.
plot(lda.fits, dimen = 2, col = classes)
lda.arrows(lda.fits, myscale = 1)
Using 3 clusters for K-means analysis, the most influencial linear separators are nox and chas. This gives us an idea that they would be the best variables to predict new observations if those would appear.
The last thing to do is 3d modeling. It really tops this weeks project as it gives us a really clear model that we can also move 360 degrees. It helps us analyse the findings and understand the model.
model_predictors <- dplyr::select(train, -crime_category)
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = classes,
colors=c("blue","yellow") )
Lets begin the Chapter 5 by loading the data.
This weeks data contains a large variety of countries. We look for example their life expectancy, GNI and education expectancy.
human <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt", header = T, sep=",")
dim(human)
## [1] 155 8
str(human)
## 'data.frame': 155 obs. of 8 variables:
## $ Edu2.FM : num 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ GNI : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ Mat.Mor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
We have 155 observations (countries) and 8 different variables. Next we look at the graphical overview of the data and try to see correlations between the variables.
p <- ggpairs(human, mapping = aes(col="blue", alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
p
There is a lot of significant correlations between the variables. For example Mat.mor - Ado.birth has 0.759 correlation.
summary(human)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
Next I chose a couple of interesting variables and tried to explain them with GNI also known as Gross National Income ($). I chose two variables: Life Expectancy (in years) and Education Expectancy (in years).
p2 <- ggplot(human, aes(x=GNI, y= Life.Exp)) + geom_point(col="deepskyblue1") + geom_smooth( col = "red2")
p2
## `geom_smooth()` using method = 'loess'
model5 <- lm(Life.Exp ~ GNI, human)
summary(model5)
##
## Call:
## lm(formula = Life.Exp ~ GNI, data = human)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.158 -3.759 2.058 4.801 10.654
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.669e+01 7.230e-01 92.236 <2e-16 ***
## GNI 2.816e-04 2.831e-05 9.947 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.514 on 153 degrees of freedom
## Multiple R-squared: 0.3927, Adjusted R-squared: 0.3887
## F-statistic: 98.94 on 1 and 153 DF, p-value: < 2.2e-16
plot(model5, which=c(1), col = "turquoise2", lwd = 3)
As we see there are quite a few outliers such as Qatar that make the model hard to fit.
p3 <- ggplot(human, aes(x=GNI, y= Edu.Exp)) + geom_point(col="red2") + geom_smooth(col = "royalblue1")
p3
## `geom_smooth()` using method = 'loess'
Next we perform principal component analysis (PCA) on the not standardized human data.
model6 <- lm(Edu.Exp ~ GNI, human)
summary(model6)
##
## Call:
## lm(formula = Edu.Exp ~ GNI, data = human)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.4643 -1.0889 0.2784 1.4838 4.6683
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.149e+01 2.471e-01 46.509 <2e-16 ***
## GNI 9.563e-05 9.673e-06 9.886 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.226 on 153 degrees of freedom
## Multiple R-squared: 0.3898, Adjusted R-squared: 0.3858
## F-statistic: 97.74 on 1 and 153 DF, p-value: < 2.2e-16
plot(model6, which=c(1), col = "turquoise2", lwd = 3)
This model also suffers a bit as there are those large outliers such as Niger and Qatar.
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human)
# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.5, 1), col = c("black", "red2"), main="Biplot of the first two principal components for the unscaled data")
The picture doesn’t look very good as we have not standardised the data. A lot of the observations are in the same spot and there seems to be very little correlation. Lets standardise the data and see how the situation changes.
human_std <- scale(human)
pca_human11 <- prcomp(human_std)
summary(pca_human11)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
## PC7 PC8
## Standard deviation 0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion 0.98702 1.00000
The data is now standardised and we can draw a new biplot of the first two principal components.
biplot(pca_human11, choices = 1:2, cex = c(0.6, 1.2), col = c("black", "red2"),main="Biplot of the first two principal components for the scaled data")
Next we go the #part 2 and start it by loading the Tea dataset from Factominer package.
data("tea")
dim(tea)
## [1] 300 36
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
I decided to drop out some variables as there was 36 of them. It was quite impossible to plot and make a grapical overview with a such variety of different variables.
keep <- c("friends", "Tea", "resto", "age_Q", "frequency")
teatimes <- dplyr::select(tea, one_of(keep))
str(teatimes)
## 'data.frame': 300 obs. of 5 variables:
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency: Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
g1 <- ggplot(teatimes, aes(friends)) + geom_bar() + theme_grey() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10))
g2 <- ggplot(teatimes, aes(Tea)) + geom_bar()+ theme_grey() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10))
g3 <- ggplot(teatimes, aes(resto)) + geom_bar() + theme_grey() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10))
g4 <- ggplot(teatimes, aes(age_Q)) + geom_bar()+ theme_grey() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10))
g5 <- ggplot(teatimes, aes(frequency)) + geom_bar()+ theme_grey() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10))
multiplot(g1, g2, g3, g4, g5, cols=3)
mca <- MCA(X = teatimes, graph=F)
summary(mca)
##
## Call:
## MCA(X = teatimes, graph = F)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.298 0.270 0.235 0.217 0.210 0.195
## % of var. 13.547 12.257 10.703 9.861 9.545 8.847
## Cumulative % of var. 13.547 25.804 36.508 46.368 55.913 64.760
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.187 0.175 0.161 0.131 0.122
## % of var. 8.511 7.949 7.313 5.933 5.533
## Cumulative % of var. 73.271 81.221 88.534 94.467 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
## 1 | 0.875 0.856 0.274 | 0.254 0.080 0.023 | -0.006 0.000
## 2 | 1.003 1.126 0.443 | -0.075 0.007 0.003 | -0.145 0.030
## 3 | -0.368 0.151 0.074 | 0.427 0.225 0.099 | -0.392 0.217
## 4 | 0.041 0.002 0.001 | -0.727 0.654 0.367 | 0.050 0.004
## 5 | 0.421 0.198 0.110 | -0.185 0.042 0.021 | 0.041 0.002
## 6 | 0.041 0.002 0.001 | -0.727 0.654 0.367 | 0.050 0.004
## 7 | -0.160 0.029 0.008 | 0.350 0.152 0.039 | 0.266 0.100
## 8 | 0.578 0.374 0.095 | 0.375 0.174 0.040 | 0.731 0.756
## 9 | 0.293 0.096 0.040 | 0.145 0.026 0.010 | 0.180 0.046
## 10 | 0.727 0.590 0.201 | 0.675 0.564 0.173 | 0.323 0.147
## cos2
## 1 0.000 |
## 2 0.009 |
## 3 0.084 |
## 4 0.002 |
## 5 0.001 |
## 6 0.002 |
## 7 0.022 |
## 8 0.152 |
## 9 0.015 |
## 10 0.040 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## friends | -0.462 9.354 0.402 -10.965 | 0.139 0.942 0.037
## Not.friends | 0.870 17.629 0.402 10.965 | -0.263 1.776 0.037
## black | 0.686 7.800 0.154 6.792 | 1.123 23.080 0.413
## Earl Grey | -0.498 10.712 0.448 -11.568 | -0.254 3.071 0.116
## green | 1.374 13.936 0.233 8.353 | -1.035 8.740 0.132
## Not.resto | 0.216 2.308 0.131 6.249 | -0.312 5.325 0.273
## resto | -0.604 6.456 0.131 -6.249 | 0.873 14.896 0.273
## 15-24 | -0.800 13.182 0.283 -9.204 | -0.412 3.866 0.075
## 25-34 | -0.170 0.447 0.009 -1.608 | -0.373 2.378 0.042
## 35-44 | 0.290 0.753 0.013 1.967 | 0.760 5.707 0.089
## v.test Dim.3 ctr cos2 v.test
## friends 3.310 | 0.059 0.191 0.006 1.394 |
## Not.friends -3.310 | -0.111 0.361 0.006 -1.394 |
## black 11.114 | 0.228 1.091 0.017 2.258 |
## Earl Grey -5.891 | -0.117 0.748 0.025 -2.717 |
## green -6.292 | 0.172 0.278 0.004 1.048 |
## Not.resto -9.029 | 0.321 6.453 0.289 9.288 |
## resto 9.029 | -0.898 18.053 0.289 -9.288 |
## 15-24 -4.741 | 0.723 13.606 0.231 8.312 |
## 25-34 -3.528 | -1.211 28.641 0.438 -11.443 |
## 35-44 5.152 | 0.241 0.659 0.009 1.636 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## friends | 0.402 0.037 0.006 |
## Tea | 0.484 0.470 0.025 |
## resto | 0.131 0.273 0.289 |
## age_Q | 0.403 0.298 0.523 |
## frequency | 0.071 0.271 0.335 |
plot(mca, invisible=c("ind"), habillage = "quali")